Introduction

Dataset from the U.S. Small Business Administration (SBA). For this case-study assignment, students assume the role of loan officer at a bank and are asked to approve or deny a loan by assessing its risk of default using logistic regression.

Background and Description of Datasets

The U.S. SBA was founded in 1953 on the principle of promoting and assisting small enterprises in the U.S. credit market. Small businesses have been a primary source of job creation in the United States; therefore, fostering small business formation and growth has social benefits by creating job opportunities and reducing unemployment. One way SBA assists these small business enterprises is through a loan guarantee program which is designed to encourage banks to grant loans to small businesses. SBA acts much like an insurance provider to reduce the risk for a bank by taking on some of the risk through guaranteeing a portion of the loan. In the case that a loan goes into default, SBA then covers the amount they guaranteed.

There have been many success stories of start-ups receiving SBA loan guarantees such as FedEx and Apple Computer. However, there have also been stories of small businesses and/or start-ups that have defaulted on their SBA-guaranteed loans. The rate of default on these loans has been a source of controversy for decades. Conservative economists believe that credit markets perform efficiently without government participation. Supporters of SBA-guaranteed loans argue that the social benefits of job creation by those small businesses receiving government-guaranteed loans far outweigh the costs incurred from defaulted loans.

Since SBA loans only guarantee a portion of the entire loan balance, banks will incur some losses if a small business defaults on its SBA-guaranteed loan. Therefore, banks are still faced with a difficult choice as to whether they should grant such a loan because of the high risk of default. One way to inform their decision making is through analyzing relevant historical data such as the datasets provided here.

https://amstat.tandfonline.com/doi/full/10.1080/10691898.2018.1434342#.W4TAIi2B00o

In [1]:
from IPython.core.display import display, HTML
display(HTML("""<style> .container {width:96% !important;}</style>"""))

from IPython.display import IFrame
In [2]:
import sys
sys.path.insert(0,'../')
from utils.paths import *
In [3]:
import pandas as pd
import numpy as np
# from plotly.offline import init_notebook_mode, iplot
# import cufflinks as cf
# init_notebook_mode()
# cf.go_offline()
from __future__ import division
In [4]:
path_data = '../large_data_files/ASA_loan_data/'
def table(no):
    # there are 13 additional table
    return pd.read_csv(path_SBA + 't00%02d'%no + '-10.1080%2F10691898.2018.1434342.csv')
table(1)

Read data

In [5]:
nat = pd.read_csv(path_SBA + 'SBAnational.csv', low_memory=False)
In [32]:
# Preprocessing
import preprocessing as pp
reload(pp)
Out[32]:
<module 'preprocessing' from '../preprocessing.pyc'>
In [7]:
%%time
nat = nat[nat.ApprovalFY != '1976A']
nat['ApprovalFY'] = nat.ApprovalFY.astype(int)
# nat = nat[(nat.ApprovalFY >= 1990) & pd.notnull(nat.MIS_Status)]
nat = nat[pd.notnull(nat.MIS_Status)]
nat = nat[pd.notnull(nat.Name)]

# convert to timestamp
nat['ApprovalDate'] = pd.to_datetime(nat['ApprovalDate'], errors = 'coerce')
nat['DisbursementDate'] = pd.to_datetime(nat['DisbursementDate'], errors = 'coerce')

# convert $ to float
nat['DisbursementGross'] = nat['DisbursementGross'].apply(pp.to_float)
nat['BalanceGross'] = nat['BalanceGross'].apply(pp.to_float)
nat['ChgOffPrinGr'] = nat['ChgOffPrinGr'].apply(pp.to_float)
nat['GrAppv'] = nat['GrAppv'].apply(pp.to_float)
nat['SBA_Appv'] = nat['SBA_Appv'].apply(pp.to_float)

# Default
nat['default'] = nat.MIS_Status.apply(pp.default)
CPU times: user 5min 32s, sys: 1.02 s, total: 5min 33s
Wall time: 5min 33s
In [8]:
%%time
# cleaning data
nat['Zip5d'] = nat.apply(lambda x: pp.zip_5d(x['Zip'], x['State']), axis = 1)
nat['Zip3d'] = nat.Zip5d.str[:3]
nat['LowDoc'] = nat.LowDoc.apply(pp.clean_LowDoc)
nat['SBA_ratio'] = nat.SBA_Appv / nat.GrAppv
nat['RevLineCr'] = nat.RevLineCr.apply(pp.clean_RevLineCr)
nat['Zip_length'] = nat.Zip.apply(lambda x: len(str(x)))
nat['RealEstate'] = nat.Term.apply(pp.RealEstate)
nat['NAICS_default_rate'] = nat.NAICS.apply(pp.naics_defaut_rate)
nat['NAICS_group'] = nat.NAICS.apply(pp.naics_sector)
nat['FranchiseCode'] = nat.FranchiseCode.apply(pp.franchise)
CPU times: user 2min 18s, sys: 1.06 s, total: 2min 19s
Wall time: 2min 19s
In [14]:
# fix missing state
nat = pp.fix_missing_state(nat)
In [19]:
nat['Name2'] = nat['Name'] + '|' + nat['State'] + '(' + nat['Zip5d'].astype(str) + ')'
In [21]:
nat.Name2.sample(10)
Out[21]:
16473     Maria G. Abbas dba Earth Spiri|MD(21093)
283659               THE MEDICINE SHOPPE|CA(95361)
433950                   Indocurrent LLC|NY(11735)
584702    CHAMPAK/SHARDA BHOJA DBA DAYSI|GA(31206)
737823              MUSICAL PRODUCTS INC|IL(60616)
270061    THE AFTROBABIES COLLECTION INC|CA(90056)
871669        FLIP-FLOP COLLECTION, INC.|CA(90015)
820860                            SUBWAY|CA(94107)
815350                    PLATO'S CLOSET|TN(37421)
3901      Mahmoud Abdalla dba Texas Used|TX(76016)
Name: Name2, dtype: object
In [22]:
nat.Name2.value_counts().head()
Out[22]:
WEST TEXAS LEE CO INC|TX(79329)             134
Meathead Movers, Inc.|CA(93401)              42
REGENT FLOORS, INC.|PA(15137)                42
ECKERT DRILLING & CONSTRUCTION|PA(15235)     27
WEST TEXAS LEE COMPANY INC|TX(79329)         26
Name: Name2, dtype: int64
In [23]:
nat.Name.value_counts().head()
Out[23]:
SUBWAY                 1269
QUIZNO'S SUBS           433
COLD STONE CREAMERY     366
QUIZNO'S                345
DOMINO'S PIZZA          328
Name: Name, dtype: int64

Create features using historical records

In [24]:
# Create a record of company applied for loan

loan_record = {}
for i in nat.Name2.unique():
    loan_record[i] = []
# len(loan_record.keys())

for i in range(len(nat)):
#     print i
    loan_record[nat.iloc[i].Name2].append(nat.iloc[i].ApprovalFY)

loan_record_df = pd.DataFrame([loan_record]).T
loan_record_df = loan_record_df.rename(columns = {0: 'loan_list'})
loan_record_df['loan_start'] = loan_record_df.loan_list.apply(lambda x: min(x))
loan_record_df['loan_record_dict'] = loan_record_df.loan_list.apply(lambda x: pd.Series(x).value_counts().to_dict())

loan_record_df = loan_record_df.reset_index()
loan_record_df = loan_record_df.rename(columns = {'index': 'Name2'})
loan_record_df = loan_record_df.drop(loan_record_df[pd.isnull(loan_record_df.Name2)].index)
loan_record_df['Name'] = loan_record_df['Name2'].apply(lambda x: x.split('|')[0])

loan_record_df['suffix'] = loan_record_df['Name'].apply(pp.company_suffix)

display(loan_record_df.head(10))

save_csv(loan_record_df, 'extra_company_info.csv')
Name2 loan_list loan_start loan_record_dict Name suffix
0 "216" PIANO BAR|AR(72764) [2002] 2002 {2002: 1} "216" PIANO BAR MARKET
1 "360" SALON & DAY SPA|WV(26101) [2002] 2002 {2002: 1} "360" SALON & DAY SPA NO SUFFIX
2 "821"|DE(19801) [1999] 1999 {1999: 1} "821" NO SUFFIX
3 "A CLASS ACT" BEAUTY SALON|NV(89102) [1989] 1989 {1989: 1} "A CLASS ACT" BEAUTY SALON SALON
4 "A LA MODE" ICE CREAM PARLOUR|FL(34221) [1997] 1997 {1997: 1} "A LA MODE" ICE CREAM PARLOUR NO SUFFIX
5 "A MARKET" NATURAL FOODS|NH(03103) [1992] 1992 {1992: 1} "A MARKET" NATURAL FOODS NO SUFFIX
6 "A SHEAR DELITE" STYLING CENTE|MS(39180) [1996] 1996 {1996: 1} "A SHEAR DELITE" STYLING CENTE NO SUFFIX
7 "A" COMPANY, INC.|ID(83705) [1990] 1990 {1990: 1} "A" COMPANY, INC. INC
8 "A" YOGURT ENTERPRISES|TX(79902) [1989] 1989 {1989: 1} "A" YOGURT ENTERPRISES ENTERPRISES
9 "AI" TERIYAKI HOUSE|WA(98003) [1995] 1995 {1995: 1} "AI" TERIYAKI HOUSE NO SUFFIX
<---- Saving csv file to s3 ---->
In [25]:
# Create a record of company default

nat_d = nat[nat.default == 1].reset_index(drop = True)

default_record = {}
for i in nat_d.Name2.unique():
    default_record[i] = []

for i in range(len(nat_d)):
    default_record[nat_d.iloc[i].Name2].append(nat_d.iloc[i].ApprovalFY)

default_record_df = pd.DataFrame([default_record]).T
default_record_df = default_record_df.rename(columns = {0: 'default_list'})
default_record_df['default_record_dict'] = default_record_df.default_list.apply(lambda x: pd.Series(x).value_counts().to_dict())

default_record_df = default_record_df.reset_index()
default_record_df = default_record_df.rename(columns = {'index': 'Name2'})
default_record_df = default_record_df.drop(default_record_df[pd.isnull(default_record_df.Name2)].index)
default_record_df['Name'] = default_record_df['Name2'].apply(lambda x: x.split('|')[0])

display(default_record_df.head(10))

save_csv(default_record_df, 'company_default_record.csv')
Name2 default_list default_record_dict Name
0 "ALEXANDRA G" SHRIMP TRAWLER|TX(77465) [2001] {2001: 1} "ALEXANDRA G" SHRIMP TRAWLER
1 "B" SWEET|NJ(07728) [1998] {1998: 1} "B" SWEET
2 "E" STREET AUTOMOTIVE REPAIR|CA(95816) [1987] {1987: 1} "E" STREET AUTOMOTIVE REPAIR
3 "FORE" SEASONS, LLC|OH(44087) [2003] {2003: 1} "FORE" SEASONS, LLC
4 "HOOT" GIBSON SALES CO.|TX(78754) [1988] {1988: 1} "HOOT" GIBSON SALES CO.
5 "I" TEMP HVAC|TX(75228) [2003] {2003: 1} "I" TEMP HVAC
6 "JOSEPH E. ELLIOTT, JR. DBA|CT(06511) [1994] {1994: 1} "JOSEPH E. ELLIOTT, JR. DBA
7 "LADY JADE"|LA(70815) [2001] {2001: 1} "LADY JADE"
8 "SANUS" BOSNIAN CAFE AND GRILL|MN(55904) [2003] {2003: 1} "SANUS" BOSNIAN CAFE AND GRILL
9 "SJSH" LLC|UT(84604) [2003] {2003: 1} "SJSH" LLC
<---- Saving csv file to s3 ---->
In [26]:
# Open saved loan records
from ast import literal_eval

extinf = pd.read_csv(path_SBA + 'extra_company_info.csv', sep = ';', low_memory=False)
extinf['loan_record_dict'] = extinf['loan_record_dict'].apply(literal_eval)
extinf[extinf.loan_record_dict.apply(lambda x: len(x.keys())) > 3].head()
Out[26]:
Name2 loan_list loan_start loan_record_dict Name suffix
951 121 MARKETING SERVICES GROUP I|MN(55374) [2000, 2001, 2001, 2002, 2004, 2005] 2000 {2000: 1, 2001: 2, 2002: 1, 2004: 1, 2005: 1} 121 MARKETING SERVICES GROUP I NO SUFFIX
8903 A TO Z MACHINING SERVICE|OK(74601) [1998, 1999, 1991, 1994] 1991 {1994: 1, 1991: 1, 1998: 1, 1999: 1} A TO Z MACHINING SERVICE SERVICE
13795 ABENTROTH DISTRIBUTING, INC.|ND(58223) [2000, 2001, 2004, 2004, 2005] 2000 {2000: 1, 2001: 1, 2004: 2, 2005: 1} ABENTROTH DISTRIBUTING, INC. INC
14361 ABRAHAM STEEL FABRICATION INC|CA(93401) [2007, 2008, 2004, 2005] 2004 {2008: 1, 2004: 1, 2005: 1, 2007: 1} ABRAHAM STEEL FABRICATION INC INC
14402 ABRASIVES, INC.|ND(58533) [1998, 1998, 1999, 2001, 1991, 1994, 1995] 1991 {1991: 1, 1994: 1, 1995: 1, 1998: 2, 1999: 1, ... ABRASIVES, INC. INC
In [27]:
# Open saved default records
default_record = pd.read_csv(path_SBA + 'company_default_record.csv', sep = ';', low_memory=False)
default_record['default_record_dict'] = default_record['default_record_dict'].apply(literal_eval)
default_record[default_record.default_record_dict.apply(lambda x: len(x.keys())) > 3].head()
Out[27]:
Name2 default_list default_record_dict Name
24311 CASH & CARRY WHOLESALE, INC.|MO(65802) [2001, 2002, 2003, 2004, 2004, 2005] {2001: 1, 2002: 1, 2003: 1, 2004: 2, 2005: 1} CASH & CARRY WHOLESALE, INC.
26882 CITRUS CAFE, INC|CA(92780) [2007, 2008, 2009, 2010] {2008: 1, 2009: 1, 2010: 1, 2007: 1} CITRUS CAFE, INC
40505 DUTCH BOYS CARPET CLEANING INC|WI(53070) [1998, 2002, 2003, 2005] {2002: 1, 2003: 1, 2005: 1, 1998: 1} DUTCH BOYS CARPET CLEANING INC
50125 FORGET ME NOT FLORIST|LA(71108) [2006, 2007, 2008, 2005] {2008: 1, 2005: 1, 2006: 1, 2007: 1} FORGET ME NOT FLORIST
136391 TONY COMPTON AND TERRIE COMPTO|MS(39114) [2009, 2010, 2003, 2003, 2006] {2009: 1, 2010: 1, 2003: 2, 2006: 1} TONY COMPTON AND TERRIE COMPTO

Create features

In [28]:
nat = nat.merge(extinf[['Name2', 'loan_start', 'loan_record_dict', 'suffix']] , how = 'left', on = 'Name2')

nat['Loan_age'] = nat.apply(lambda x: pp.loan_age(x['ApprovalFY'], x['loan_record_dict']), axis = 1)
nat['Previous_loan'] = nat.apply(lambda x: pp.previous_loan(x['ApprovalFY'], x['loan_record_dict']), axis = 1)

nat = nat.merge(default_record[['Name2', 'default_record_dict']] , how = 'left', on = 'Name2')

nat['default_times'] = nat.apply(lambda x: pp.default_times(x['ApprovalFY'], x['default_record_dict']), axis = 1)
In [29]:
nat[['Loan_age', 'Previous_loan', 'default_times']].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 897137 entries, 0 to 897136
Data columns (total 3 columns):
Loan_age         897137 non-null int64
Previous_loan    897137 non-null int64
default_times    897137 non-null int64
dtypes: int64(3)
memory usage: 27.4 MB
loan_record_df[loan_record_df.Suffix == 'NO SUFFIX'].Name.apply(lambda x: x.split()[-1].replace('.', '').upper()).value_counts().head(50)
In [34]:
nat.head()
Out[34]:
LoanNr_ChkDgt Name City State Zip Bank BankState NAICS ApprovalDate ApprovalFY ... NAICS_default_rate NAICS_group Name2 loan_start loan_record_dict suffix Loan_age Previous_loan default_record_dict default_times
0 1000014003 ABC HOBBYCRAFT EVANSVILLE IN 47711 FIFTH THIRD BANK OH 451120 1997-02-28 1997 ... 23.0 44 ABC HOBBYCRAFT|IN(47711) 1997 {1997: 1} NO SUFFIX 0 0 NaN 0
1 1000024006 LANDMARK BAR & GRILLE (THE) NEW PARIS IN 46526 1ST SOURCE BANK IN 722410 1997-02-28 1997 ... 22.0 72 LANDMARK BAR & GRILLE (THE)|IN(46526) 1997 {1997: 1} NO SUFFIX 0 0 NaN 0
2 1000034009 WHITLOCK DDS, TODD M. BLOOMINGTON IN 47401 GRANT COUNTY STATE BANK IN 621210 1997-02-28 1997 ... 10.0 62 WHITLOCK DDS, TODD M.|IN(47401) 1997 {1997: 1} NO SUFFIX 0 0 NaN 0
3 1000044001 BIG BUCKS PAWN & JEWELRY, LLC BROKEN ARROW OK 74012 1ST NATL BK & TR CO OF BROKEN OK 0 1997-02-28 1997 ... NaN 0 BIG BUCKS PAWN & JEWELRY, LLC|OK(74012) 1997 {1997: 1} LLC 0 0 NaN 0
4 1000054004 ANASTASIA CONFECTIONS, INC. ORLANDO FL 32801 FLORIDA BUS. DEVEL CORP FL 0 1997-02-28 1997 ... NaN 0 ANASTASIA CONFECTIONS, INC.|FL(32801) 1997 {1997: 1} INC 0 0 NaN 0

5 rows × 43 columns

Exploratory data analysis

In [ ]:
nat.GrAppv.iplot(kind = 'hist', bins = 100, title = 'Histogram of grant approved', xTitle = 'Grant')
In [ ]:
# No. of loan each year
nat.groupby('ApprovalFY').count().max(1).iplot(kind = 'bar', title = 'Loan cases')

Default rate

In [ ]:
(nat[nat.default == 1].groupby('ApprovalFY').count().max(1) / 
 nat.groupby('ApprovalFY').count().max(1)).iplot(kind = 'bar', title = 'Default rate')
In [ ]:
nat[nat.default == 1].groupby('ApprovalFY').SBA_Appv.sum().iplot(kind = 'bar')
In [ ]:
nat[nat.default == 1].ChgOffPrinGr.iplot(kind = 'hist', bins = 50)
In [ ]:
nat.head().T
In [ ]:
nat.State.value_counts().head()
In [ ]:
# default rate
nat['default'].sum() / len(nat)
In [ ]:
nat.default.value_counts()

LowDoc

LowDoc (Y = Yes, N = No): In order to process more loans efficiently, a “LowDoc Loan” program was implemented where loans under $150,000 can be processed using a one-page application. “Yes” indicates loans with a one-page application, and “No” indicates loans with more information attached to the application. In this dataset, 87.31% are coded as N (No) and 12.31% as Y (Yes) for a total of 99.62%. It is worth noting that 0.38% have other values (0, 1, A, C, R, S); these are data entry errors.

In [ ]:
nat.LowDoc.sample(10)

NAICS

NAICS (North American Industry Classification System): This is a 2- through 6-digit hierarchical classification system used by Federal statistical agencies in classifying business establishments for the collection, analysis, and presentation of statistical data describing the U.S. economy. The first two digits of the NAICS classification represent the economic sector.

In [ ]:
table(3)
In [ ]:
nat.NAICS.value_counts().head()
In [ ]:
# nat.NAICS.apply(lambda x: len(str(x))).value_counts()

    
In [ ]:
nat['NAICS_group'].value_counts().sort_index()

Loans Backed by Real Estate

Whether a loan is backed by real estate (possession of land) is another risk indicator that is discussed. The rationale for this indicator is that the value of the land is often large enough to cover the amount of any principal outstanding, thereby reducing the probability of default.

Since the term of the loan is a function of the expected lifetime of the assets, loans backed by real estate will have terms 20 years or greater (≥240 months) and are the only loans granted for such a long term, whereas loans not backed by real estate will have terms less than 20 years (<240 months). Therefore, the authors created a dummy variable, “RealEstate,” where “RealEstate” = 1 if “Term” ≥240 months and “RealEstate” = 0 if “Term” <240 months.

In [29]:
nat.head()
Out[29]:
LoanNr_ChkDgt Name City State Zip Bank BankState NAICS ApprovalDate ApprovalFY ... DisbursementGross BalanceGross MIS_Status ChgOffPrinGr GrAppv SBA_Appv default NAICS_group NAICS_default_rate RealEstate
0 1000014003 ABC HOBBYCRAFT EVANSVILLE IN 47711 FIFTH THIRD BANK OH 451120 1997-02-28 1997 ... 60000.0 0.0 P I F 0.0 60000.0 48000.0 0 44 23.0 0
1 1000024006 LANDMARK BAR & GRILLE (THE) NEW PARIS IN 46526 1ST SOURCE BANK IN 722410 1997-02-28 1997 ... 40000.0 0.0 P I F 0.0 40000.0 32000.0 0 72 22.0 0
2 1000034009 WHITLOCK DDS, TODD M. BLOOMINGTON IN 47401 GRANT COUNTY STATE BANK IN 621210 1997-02-28 1997 ... 287000.0 0.0 P I F 0.0 287000.0 215250.0 0 62 10.0 0
3 1000044001 BIG BUCKS PAWN & JEWELRY, LLC BROKEN ARROW OK 74012 1ST NATL BK & TR CO OF BROKEN OK 0 1997-02-28 1997 ... 35000.0 0.0 P I F 0.0 35000.0 28000.0 0 0 NaN 0
4 1000054004 ANASTASIA CONFECTIONS, INC. ORLANDO FL 32801 FLORIDA BUS. DEVEL CORP FL 0 1997-02-28 1997 ... 229000.0 0.0 P I F 0.0 229000.0 229000.0 0 0 NaN 1

5 rows × 31 columns

SBA's Guaranteed Portion of Approved Loan

The portion which is the percentage of the loan that is guaranteed by SBA (represented as “Portion” in the dataset) is a final risk indicator that is discussed in our courses. This is one of the variables that the authors generated calculating the ratio of the amount of the loan SBA guarantees and the gross amount approved by the bank (SBA_Appv/GrAppv)

In [30]:
nat.SBA_ratio.iplot(kind = 'hist', bins = 10, title = 'Histogram of SBA approved ratio', xTitle = 'SBA_ratio')
In [31]:
nat[nat.default == 1].SBA_ratio.iplot(kind = 'hist', bins = 10)
In [32]:
nat.groupby('default').SBA_ratio.mean()
Out[32]:
default
0    0.718599
1    0.632673
Name: SBA_ratio, dtype: float64
In [33]:
nat.head().T
Out[33]:
0 1 2 3 4
LoanNr_ChkDgt 1000014003 1000024006 1000034009 1000044001 1000054004
Name ABC HOBBYCRAFT LANDMARK BAR & GRILLE (THE) WHITLOCK DDS, TODD M. BIG BUCKS PAWN & JEWELRY, LLC ANASTASIA CONFECTIONS, INC.
City EVANSVILLE NEW PARIS BLOOMINGTON BROKEN ARROW ORLANDO
State IN IN IN OK FL
Zip 47711 46526 47401 74012 32801
Bank FIFTH THIRD BANK 1ST SOURCE BANK GRANT COUNTY STATE BANK 1ST NATL BK & TR CO OF BROKEN FLORIDA BUS. DEVEL CORP
BankState OH IN IN OK FL
NAICS 451120 722410 621210 0 0
ApprovalDate 1997-02-28 00:00:00 1997-02-28 00:00:00 1997-02-28 00:00:00 1997-02-28 00:00:00 1997-02-28 00:00:00
ApprovalFY 1997 1997 1997 1997 1997
Term 84 60 180 60 240
NoEmp 4 2 7 2 14
NewExist 2 2 1 1 1
CreateJob 0 0 0 0 7
RetainedJob 0 0 0 0 7
FranchiseCode 1 1 1 1 1
UrbanRural 0 0 0 0 0
RevLineCr N N N N N
LowDoc 1 1 0 1 0
ChgOffDate NaN NaN NaN NaN NaN
DisbursementDate 1999-02-28 00:00:00 1997-05-31 00:00:00 1997-12-31 00:00:00 1997-06-30 00:00:00 1997-05-14 00:00:00
DisbursementGross 60000 40000 287000 35000 229000
BalanceGross 0 0 0 0 0
MIS_Status P I F P I F P I F P I F P I F
ChgOffPrinGr 0 0 0 0 0
GrAppv 60000 40000 287000 35000 229000
SBA_Appv 48000 32000 215250 28000 229000
default 0 0 0 0 0
NAICS_group 44 72 62 0 0
NAICS_default_rate 23 22 10 NaN NaN
RealEstate 0 0 0 0 1
SBA_ratio 0.8 0.8 0.75 0.8 1

RevLineCr

Revolving Line of Credit : Y = Yes

In [34]:
nat.RevLineCr.value_counts()
Out[34]:
N    391402
0    257431
Y    200631
T     15239
1        23
R        14
`        11
2         6
C         2
5         1
7         1
4         1
-         1
A         1
Q         1
3         1
,         1
.         1
Name: RevLineCr, dtype: int64

Zip code

A ZIP Code is a postal code used by the United States Postal Service (USPS) in a system it introduced in 1963.

The first digit of the ZIP Code is allocated as follows:

0 = Connecticut (CT), Massachusetts (MA), Maine (ME), New Hampshire (NH), New Jersey (NJ), New York (NY, Fishers Island only), Puerto Rico (PR), Rhode Island (RI), Vermont (VT), Virgin Islands (VI), Army Post Office Europe (AE), Fleet Post Office Europe (AE) 1 = Delaware (DE), New York (NY), Pennsylvania (PA) 2 = District of Columbia (DC), Maryland (MD), North Carolina (NC), South Carolina (SC), Virginia (VA), West Virginia (WV) 3 = Alabama (AL), Florida (FL), Georgia (GA), Mississippi (MS), Tennessee (TN), Army Post Office Americas (AA), Fleet Post Office Americas (AA) 4 = Indiana (IN), Kentucky (KY), Michigan (MI), Ohio (OH) 5 = Iowa (IA), Minnesota (MN), Montana (MT), North Dakota (ND), South Dakota (SD), Wisconsin (WI) 6 = Illinois (IL), Kansas (KS), Missouri (MO), Nebraska (NE) 7 = Arkansas (AR), Louisiana (LA), Oklahoma (OK), Texas (TX) 8 = Arizona (AZ), Colorado (CO), Idaho (ID), New Mexico (NM), Nevada (NV), Utah (UT), Wyoming (WY) 9 = Alaska (AK), American Samoa (AS), California (CA), Guam (GU), Hawaii (HI), Marshall Islands (MH), Federated States of Micronesia (FM), Northern Mariana Islands (MP), Oregon (OR), Palau (PW), Washington (WA), Army Post Office Pacific (AP), Fleet Post Office Pacific (AP)

The next two digits represent the sectional center facility (SCF) (e.g. 477xx = Vanderburgh County, Indiana), and the fourth and fifth digits represent the area of the city (if in a metropolitan area), or a village/town (outside metro areas): 47722 (4=Indiana, 77=Vanderburgh County, 22=University of Evansville area). When a sectional center facility's area crosses state lines, that facility is assigned separate three-digit prefixes for the states that it serves.

(https://en.wikipedia.org/wiki/ZIP_Code)

Connecticut (CT) Massachusetts (MA) Maine (ME) New Hampshire (NH) New Jersey (NJ) New York (NY) Puerto Rico (PR) Rhode Island (RI) Vermont (VT) Virgin Islands (VI) Army Post Office Europe (AE) Fleet Post Office Europe (AE)

'CT','MA','ME','NH','NJ','NY','PR','RI','VT','VI','AE','AE'

In [36]:
nat['Zip_length'].value_counts()
Out[36]:
5    777484
4     91392
1       377
3        33
2         2
Name: Zip_length, dtype: int64

Seem like some of the zip code has missing values!

In [38]:
nat[nat['Zip5d'] == '99999'].shape
Out[38]:
(828, 34)
In [40]:
nat['Zip5d'].value_counts().head()
Out[40]:
10001    924
90015    918
99999    828
93401    744
90010    732
Name: Zip5d, dtype: int64
In [41]:
nat['Zip3d'].value_counts().head()
Out[41]:
900    12286
770    10073
840     7916
750     7824
917     6962
Name: Zip3d, dtype: int64
In [42]:
nat = nat[['LoanNr_ChkDgt', 'Name', 'City', 'State', 'Zip5d', 'Zip3d', 'Bank', 'BankState', 'NAICS', 'NAICS_group', 
           'NAICS_default_rate', 'ApprovalDate', 'ApprovalFY', 'Term', 'NoEmp', 'NewExist', 'CreateJob', 'RetainedJob', 
           'FranchiseCode', 'UrbanRural', 'RevLineCr', 'LowDoc', 'ChgOffDate', 'DisbursementDate', 'DisbursementGross', 
           'BalanceGross', 'MIS_Status', 'ChgOffPrinGr', 'GrAppv', 'SBA_Appv', 'default', 'RealEstate', 'SBA_ratio']]
In [46]:
pd.isnull(nat).sum()
Out[46]:
LoanNr_ChkDgt              0
Name                      12
City                       0
State                      8
Zip5d                      0
Zip3d                      0
Bank                    1037
BankState               1043
NAICS                      0
NAICS_group                0
NAICS_default_rate    181207
ApprovalDate               0
ApprovalFY                 0
Term                       0
NoEmp                      0
NewExist                 134
CreateJob                  0
RetainedJob                0
FranchiseCode              0
UrbanRural                 0
RevLineCr                  0
LowDoc                  5999
ChgOffDate            714886
DisbursementDate        2011
DisbursementGross          0
BalanceGross               0
MIS_Status                 0
ChgOffPrinGr               0
GrAppv                     0
SBA_Appv                   0
default                    0
RealEstate                 0
SBA_ratio                  0
dtype: int64

Saving datasets

In [37]:
!aws s3 ls --human-readable s3://eh-home/ehda-calvin/SBA_study/
                           PRE pdf/
2018-10-29 08:48:12   15.8 KiB 7a_504_FOIA Data Dictionary.xlsx
2018-10-29 08:48:13   28.9 MiB FOIA - 504 (FY1991-Present).xlsx
2018-10-29 08:48:15   52.6 MiB FOIA - 7(a)(FY1991-FY1999).xlsx
2018-10-29 08:48:17  111.1 MiB FOIA - 7(a)(FY2000-FY2009).xlsx
2018-10-29 08:48:19   80.8 MiB FOIA - 7(a)(FY2010-Present).xlsx
2018-10-26 03:13:21  171.1 MiB SBAnational.csv
2018-11-01 07:21:49  232.1 MiB SBAnational_new.csv
2018-11-01 07:11:23   10.4 MiB company_default_record.csv
2018-11-01 07:03:29   67.9 MiB extra_company_info.csv
2018-10-29 03:22:11    9.6 MiB loan_record.csv
2018-10-29 02:42:40    1.4 KiB t0001-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:41  772 Bytes t0002-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:42  785 Bytes t0003-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:43  913 Bytes t0004-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:43  202 Bytes t0005-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:44  203 Bytes t0006-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:45  289 Bytes t0007-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:46  408 Bytes t0008-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:47  204 Bytes t0009-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:48  294 Bytes t0010-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:49  250 Bytes t0011-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:50  316 Bytes t0012-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:50    2.3 KiB t0013-10.1080%2F10691898.2018.1434342.csv
In [35]:
save_csv(nat, 'SBAnational_new.csv')
<---- Saving csv file to s3 ---->
In [39]:
df = pd.read_excel(path_SBA + "FOIA - 7(a)(FY2000-FY2009).xlsx")
df.head()
Out[39]:
AsOfDate Program BorrName BorrStreet BorrCity BorrState BorrZip BankName BankStreet BankCity ... ProjectCounty ProjectState SBADistrictOffice CongressionalDistrict BusinessType LoanStatus ChargeOffDate GrossChargeOffAmount RevolverStatus JobsSupported
0 20180930 7A CREATIVE SUNROOM DESIGNS INC 9333 CASTLEGATE DR INDIANAPOLIS IN 46256 Busey Bank 100 W University Ave CHAMPAIGN ... MARION IN INDIANA DISTRICT OFFICE 5.0 CORPORATION PIF NaT 0 0 19
1 20180930 7A Tan Nhu Tieu PARKER AND COIT RD. PLANO TX 75023 Gulf Coast Bank and Trust Company 200 St. Charles Ave NEW ORLEANS ... COLLIN TX DALLAS / FT WORTH DISTRICT OFFICE 3.0 CORPORATION PIF NaT 0 0 16
2 20180930 7A DESIGNER CHOICE 5530 W. ADAMS BLVD. LOS ANGELES CA 90016 Bank of Hope 3200 Wilshire Blvd, Ste 1400 LOS ANGELES ... LOS ANGELES CA LOS ANGELES DISTRICT OFFICE 37.0 INDIVIDUAL CANCLD NaT 0 0 0
3 20180930 7A K & L AUTOMOTIVE REPAIR INC 1593 ELIZABETH AVE RAHWAY NJ 7065 Popular Bank 11 W 51st St NEW YORK CITY ... UNION NJ NEW JERSEY DISTRICT OFFICE 6.0 CORPORATION PIF NaT 0 0 4
4 20180930 7A RANDY'S STEAK HOUSE 7026 MAIN FRISCO TX 75034 Northeast Bank 500 Canal St LEWISTON ... COLLIN TX DALLAS / FT WORTH DISTRICT OFFICE 3.0 CORPORATION PIF NaT 0 0 23

5 rows × 35 columns

In [40]:
df.head().T
Out[40]:
0 1 2 3 4
AsOfDate 20180930 20180930 20180930 20180930 20180930
Program 7A 7A 7A 7A 7A
BorrName CREATIVE SUNROOM DESIGNS INC Tan Nhu Tieu DESIGNER CHOICE K & L AUTOMOTIVE REPAIR INC RANDY'S STEAK HOUSE
BorrStreet 9333 CASTLEGATE DR PARKER AND COIT RD. 5530 W. ADAMS BLVD. 1593 ELIZABETH AVE 7026 MAIN
BorrCity INDIANAPOLIS PLANO LOS ANGELES RAHWAY FRISCO
BorrState IN TX CA NJ TX
BorrZip 46256 75023 90016 7065 75034
BankName Busey Bank Gulf Coast Bank and Trust Company Bank of Hope Popular Bank Northeast Bank
BankStreet 100 W University Ave 200 St. Charles Ave 3200 Wilshire Blvd, Ste 1400 11 W 51st St 500 Canal St
BankCity CHAMPAIGN NEW ORLEANS LOS ANGELES NEW YORK CITY LEWISTON
BankState IL LA CA NY ME
BankZip 61824 70130 90010 10019 04240
GrossApproval 400000 1710600 150000 175000 380000
SBAGuaranteedApproval 300000 750000 112500 131250 285000
ApprovalDate 1999-10-01 00:00:00 1999-10-01 00:00:00 1999-10-01 00:00:00 1999-10-01 00:00:00 1999-10-01 00:00:00
ApprovalFiscalYear 2000 2000 2000 2000 2000
FirstDisbursementDate 1999-11-30 00:00:00 2001-04-30 00:00:00 NaT 2000-02-29 00:00:00 1999-10-31 00:00:00
DeliveryMethod PLP PLP OTH 7A PLP PLP
subpgmdesc Guaranty Guaranty Guaranty Guaranty Guaranty
InitialInterestRate NaN NaN NaN NaN NaN
TermInMonths 240 306 120 288 240
NaicsCode 235610 812320 NaN 811121 722110
NaicsDescription Roofing, Siding, and Sheet Metal Contractors Drycleaning and Laundry Services (except Coin-... NaN Automotive Body, Paint, and Interior Repair an... Full-Service Restaurants
FranchiseCode NaN NaN NaN NaN NaN
FranchiseName NaN NaN NaN NaN NaN
ProjectCounty MARION COLLIN LOS ANGELES UNION COLLIN
ProjectState IN TX CA NJ TX
SBADistrictOffice INDIANA DISTRICT OFFICE DALLAS / FT WORTH DISTRICT OFFICE LOS ANGELES DISTRICT OFFICE NEW JERSEY DISTRICT OFFICE DALLAS / FT WORTH DISTRICT OFFICE
CongressionalDistrict 5 3 37 6 3
BusinessType CORPORATION CORPORATION INDIVIDUAL CORPORATION CORPORATION
LoanStatus PIF PIF CANCLD PIF PIF
ChargeOffDate NaT NaT NaT NaT NaT
GrossChargeOffAmount 0 0 0 0 0
RevolverStatus 0 0 0 0 0
JobsSupported 19 16 0 4 23
In [41]:
df.BusinessType.value_counts()
Out[41]:
CORPORATION    501676
INDIVIDUAL     168399
PARTNERSHIP     20227
Name: BusinessType, dtype: int64
In [43]:
nat.Name.isin(df.BorrName).sum()
Out[43]:
528467
In [44]:
nat.shape
Out[44]:
(897137, 43)
In [46]:
df.RevolverStatus.value_counts()
Out[46]:
0    468610
1    221737
Name: RevolverStatus, dtype: int64
In [50]:
df.head()
Out[50]:
AsOfDate Program BorrName BorrStreet BorrCity BorrState BorrZip BankName BankStreet BankCity ... ProjectCounty ProjectState SBADistrictOffice CongressionalDistrict BusinessType LoanStatus ChargeOffDate GrossChargeOffAmount RevolverStatus JobsSupported
0 20180930 7A CREATIVE SUNROOM DESIGNS INC 9333 CASTLEGATE DR INDIANAPOLIS IN 46256 Busey Bank 100 W University Ave CHAMPAIGN ... MARION IN INDIANA DISTRICT OFFICE 5.0 CORPORATION PIF NaT 0 0 19
1 20180930 7A Tan Nhu Tieu PARKER AND COIT RD. PLANO TX 75023 Gulf Coast Bank and Trust Company 200 St. Charles Ave NEW ORLEANS ... COLLIN TX DALLAS / FT WORTH DISTRICT OFFICE 3.0 CORPORATION PIF NaT 0 0 16
2 20180930 7A DESIGNER CHOICE 5530 W. ADAMS BLVD. LOS ANGELES CA 90016 Bank of Hope 3200 Wilshire Blvd, Ste 1400 LOS ANGELES ... LOS ANGELES CA LOS ANGELES DISTRICT OFFICE 37.0 INDIVIDUAL CANCLD NaT 0 0 0
3 20180930 7A K & L AUTOMOTIVE REPAIR INC 1593 ELIZABETH AVE RAHWAY NJ 7065 Popular Bank 11 W 51st St NEW YORK CITY ... UNION NJ NEW JERSEY DISTRICT OFFICE 6.0 CORPORATION PIF NaT 0 0 4
4 20180930 7A RANDY'S STEAK HOUSE 7026 MAIN FRISCO TX 75034 Northeast Bank 500 Canal St LEWISTON ... COLLIN TX DALLAS / FT WORTH DISTRICT OFFICE 3.0 CORPORATION PIF NaT 0 0 23

5 rows × 35 columns

In [51]:
df['Name2'] = df['BorrName'] + '|' + df['BorrState'] + '(' + df['BorrZip'].astype(str) + ')'
In [54]:
df[df['Name2'] == 'Meathead Movers, Inc.|CA(93401)'].BusinessType
Out[54]:
252499    CORPORATION
254757    CORPORATION
278299    CORPORATION
352911    CORPORATION
378579    CORPORATION
378583    CORPORATION
378588    CORPORATION
378589    CORPORATION
378606    CORPORATION
378646    CORPORATION
378694    CORPORATION
378695    CORPORATION
378696    CORPORATION
378698    CORPORATION
378706    CORPORATION
378709    CORPORATION
378711    CORPORATION
381706    CORPORATION
381708    CORPORATION
381710    CORPORATION
397108    CORPORATION
398646    CORPORATION
398648    CORPORATION
400068    CORPORATION
435217    CORPORATION
436777    CORPORATION
452469    CORPORATION
496847    CORPORATION
496870    CORPORATION
509632    CORPORATION
             ...     
602623    CORPORATION
602624    CORPORATION
602627    CORPORATION
602628    CORPORATION
602629    CORPORATION
602630    CORPORATION
602631    CORPORATION
602632    CORPORATION
602633    CORPORATION
602634    CORPORATION
602635    CORPORATION
602636    CORPORATION
602637    CORPORATION
602638    CORPORATION
602639    CORPORATION
624276    CORPORATION
630089    CORPORATION
660636    CORPORATION
660637    CORPORATION
660638    CORPORATION
660639    CORPORATION
660640    CORPORATION
660641    CORPORATION
660642    CORPORATION
660643    CORPORATION
660644    CORPORATION
660645    CORPORATION
660646    CORPORATION
660647    CORPORATION
661503    CORPORATION
Name: BusinessType, Length: 93, dtype: object
In [47]:
df.BorrName.value_counts()
Out[47]:
SUBWAY                            1187
QUIZNO'S SUBS                      562
COLD STONE CREAMERY                511
THE UPS STORE                      446
QUIZNO'S                           398
DUNKIN DONUTS                      271
DAIRY QUEEN                        213
MAIL BOXES ETC                     212
MATCO TOOLS                        208
QUIZNO'S CLASSIC SUBS              203
CURVES FOR WOMEN                   196
DOMINO'S PIZZA                     189
WINGSTOP                           186
SPORT CLIPS                        176
PLANET BEACH                       159
GREAT CLIPS                        141
QUIZNOS SUBS                       131
FANTASTIC SAMS                     128
MASSAGE ENVY                       126
MARBLE SLAB CREAMERY               120
SMOOTHIE KING                      120
QUIZNOS                            120
CURVES                             120
MINUTEMAN PRESS                    115
HUNTINGTON LEARNING CENTER         115
HOLLYWOOD TANS                     114
EDIBLE ARRANGEMENTS                113
DAYS INN                           101
CICI'S PIZZA                        99
COTTMAN TRANSMISSION                99
                                  ... 
Dynamic Movers Inc.                  1
Jan's Little Angels Inc              1
CRANE ACTION, LLC                    1
JOSEPH BISIGNARO DBA JAB MECHA       1
LIGHTHOUSE REALTY GROUP INC          1
MARSHALL TEAL ENTERPRISES INC        1
XUNDA A. GIBSON M.D. INC             1
DRY CLEAN SUPER CENTER ON FM72       1
TAIPAN CAFE                          1
CENTRE FOR PREVENTIVE MEDICINE       1
Jagadish R. Patel dba Whitneyv       1
SUSHL UMI                            1
PET EMPORIUM                         1
Midwest Office Products              1
Prashanthi Vadhi DDS Inc             1
James J. Brooks Inc.                 1
LUI TECHNICAL SERVICES INC           1
DAEDALUS INC                         1
CASEY JONES RESTAURANT               1
JAMES L GREER                        1
THE PEANUT IN GLADSTONE, INC.        1
Charles Pattman                      1
ANGULAR MACHINING INC                1
MIDLANTIC BUILDING SYSTEMS LLC       1
Samuel R. Bradshaw                   1
The Cleaning Machine, Inc.           1
THE WASH HOUSE OF PEARL, INC.        1
BELLA NONNA RISTORANTE ITALIAN       1
Go Fishiing LLC                      1
BTC CORP OF USA                      1
Name: BorrName, Length: 592778, dtype: int64
In [ ]:
df = pd.DataFrame()
for f in ['FOIA - 7(a)(FY1991-FY1999).xlsx', 
          'FOIA - 7(a)(FY2000-FY2009).xlsx', 
          'FOIA - 7(a)(FY2010-Present).xlsx']:
    temp = pd.read_excel(path_SBA + f)
    temp = temp[(pd.notnull(temp.BorrName)) & (pd.notnull(temp.BorrState)) & 
                (pd.notnull(temp.BorrZip)) & (temp.Program == '7A')]
    temp = temp[['BorrName', 'BorrState', 'BorrZip', 'BusinessType']]
    df = pd.concat([df, temp])

dfs = pd.read_excel(path_SBA + "FOIA - 7(a)(FY2000-FY2009).xlsx")